-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable inference regret #2782
Enable inference regret #2782
Conversation
This pull request was exported from Phabricator. Differential Revision: D61930178 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2782 +/- ##
==========================================
- Coverage 95.68% 95.68% -0.01%
==========================================
Files 488 488
Lines 47843 47943 +100
==========================================
+ Hits 45779 45874 +95
- Misses 2064 2069 +5 ☔ View full report in Codecov by Sentry. |
This pull request was exported from Phabricator. Differential Revision: D61930178 |
Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178
fc86d34
to
5b06db2
Compare
This pull request was exported from Phabricator. Differential Revision: D61930178 |
Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178
5b06db2
to
e0a1512
Compare
Summary: Pull Request resolved: facebook#2782 # Context: Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the `BenchmarkRunner`. This produces an `optimization_trace` used for measuring performance. (For MOO, the hypervolume of all points tested is computed.) While this trace does a good job of capturing whether a good point has been tested, it does not capture *inference regret*: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as * Bandit problems (in a noisy and discrete space), where the best point will be seen quickly; the question is when the model identifies it * Multi-fidelity problems, where simply evaluating as many small arms as possible maximizes the current metric for optimization value * Noisy problems, if different best-point selection strategies are being considered. # Open questions * Should inference value always be computed? My take: Yes, it needen't add much computational overhead, as long as evaluating the same parameterization a second time isn't expensive, because we can use a best-pont selection strategy of "empirical best." Current implementation: Always computes this. * Should the "oracle trace" (the status quo behavior) always be computed? My take: Yes, because people say they find it helpful, and for consistency with the past. Current implementation: Always computes this. * If we want both, should we tag one of the two traces as "the" trace, for backwards compatibility? The current implementation does this; `BenchmarkResult.optimization_trace` is one of the `inference_value_trace` and the `oracle_trace`, with the `BenchmarkProblem` specifying which one. * Set of best points returned for MOO: Is choosing K points and then evaluating them by hypervolume what we want? * To what degree do we want to rely on Ax's `BestPointMixin` functionality, which is pretty stale, missing functionality we want, requires constructing dummy `Experiment`s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues. * When should the trace be updated in async settings? * This diff adds support for SOO and MOO and for `n_best_points`, but only supports SOO with 1 best point. That's a lot of infra for raising `NotImplementedError`s. Is this what we want? * In sample and out of sample: Currently, I'm not using these terms at all since they are confusing in multi-task and multi-fidelity contexts. Is that what we want? * When people develop best-point functionality in the future, would they do it be updating or adding options to `BestPointMixin._get_trace`? I wrote this under the assumption that they would either do that or use a similar method that consumes an `experiment` and `optimization_config` and can access the `generation_strategy` used. # This diff ## High-level changes Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the `BenchmarkResult`. The old trace is renamed the `oracle_trace`. `optimization_trace` continues to exist; it can be either the `oracle_trace` (default) or the `inference_trace`, depending on what the `BenchmarkProblem` specifies. The `BenchmarkMethod` is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden. There are major limitations: * *The ideal approach for MOO isn't supported yet, so MOO isn't supported at all with inference value*: The `BenchmarkProblem` specifies `n_best_points`, how many points are returned as the best, and for MOO, we would want `n_best_points > 1` and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't cap `n_best_points`, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for getting `k` points to maximize expected hypervolume. * Different best-point selectors can be compared by using a different `BenchmarkMethod`, either by passing different `best_point_kwargs` to the `BenchmarkMethod` or by subclassing `BenchmarkMethod` and overriding `get_best_parameters`. ## Detailed changes ### BenchmarkResult Docstrings ought to be self-explanatory. * The old `optimization_trace` becomes `oracle_trace` * It always has an `inference_value_trace` as well as an `oracle_trace` * The `optimization_trace` can be either, depending on what the `BenchmarkProblem` specifies. ### `benchmark_replication` * Computes inference value after each time the scheduler generates a trial. Note that incomplete trials can thus be used, since this could then happen before the trial completes. * For MOO, should find K pareto-optimal parameters (according to the model), get their oracle values, and get the hypervolume of those oracle values in the following manner: constructs a new experiment with one BatchTrial whose arms are K pareto-optimal parameters and whose metrics are oracle values, and uses Ax's best-point functionality to get the hypervolume. This is done to avoid re-implementing inference of objective thresholds, using constraints, weighting, etc. HOWEVER, MOO is currently unsupported because we don't have a way of getting the K best. * For SOO, finds K best parameters (according to the model) and gets their oracle value. HOWEVER, K>1 is currently unsupported. ### `BenchmarkProblem` * Gets an attribute `report_inference_value_as_trace` that * Makes the BenchmarkResult's `optimization_trace` be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory. * Adds to BenchmarkProblem. Docstrings should be self-explanatory. ### `BenchmarkMethod` * Adds a method `get_best_parameters` and an attribute `best_point_kwargs`. If not overridden, `get_best_parameters` uses `BestPointMixin._get_trace` and passes it the `best_point_kwargs`. * Currently, the only supported argument in `best_point_kwargs` is "use_model_predictions". Reviewed By: Balandat Differential Revision: D61930178
This pull request was exported from Phabricator. Differential Revision: D61930178 |
e0a1512
to
d62296c
Compare
This pull request has been merged in 75b4bf8. |
Summary:
Context:
Currently, the benchmarks compute an "oracle" value for each point seen, which evaluates the point noiselessly and at the target task and fidelity, or in a way specified by the
BenchmarkRunner
. This produces anoptimization_trace
used for measuring performance. (For MOO, the hypervolume of all points tested is computed.)While this trace does a good job of capturing whether a good point has been tested, it does not capture inference regret: The difference between the value of the the point the model would recommend and that of the best point. This distinction becomes important (both for getting a good measure of absolute performance and for comparing methods) in contexts such as
Open questions
BenchmarkResult.optimization_trace
is one of theinference_value_trace
and theoracle_trace
, with theBenchmarkProblem
specifying which one.BestPointMixin
functionality, which is pretty stale, missing functionality we want, requires constructing dummyExperiment
s, and won't do the right thing for multi-fidelity and multi-task methods? An alternative approach would be to support this for MBM only, which would address or enable addressing all these issues.n_best_points
, but only supports SOO with 1 best point. That's a lot of infra for raisingNotImplementedError
s. Is this what we want?BestPointMixin._get_trace
? I wrote this under the assumption that they would either do that or use a similar method that consumes anexperiment
andoptimization_config
and can access thegeneration_strategy
used.This diff
High-level changes
Technically, this adds "inference value" rather than "inference regret", because it is not relative to the optimum. That gives it the same sign as the default optimization_trace. It is always computed and returned on the
BenchmarkResult
. The old trace is renamed theoracle_trace
.optimization_trace
continues to exist; it can be either theoracle_trace
(default) or theinference_trace
, depending on what theBenchmarkProblem
specifies. TheBenchmarkMethod
is responsible for specifying a best-point selector. This currently relies heavily on Ax's best-point functionality, but this can be overridden.There are major limitations:
BenchmarkProblem
specifiesn_best_points
, how many points are returned as the best, and for MOO, we would wantn_best_points > 1
and to take the hypervolume of the oracle values at those points. That is the only way it makes sense to set this up if we want to compare best-point selectors. If we use hypervolume and don't capn_best_points
, the ideal best-point selector would give every point. Metrics other than hypervolume, such as the fraction of "best" points actually on the Pareto frontier, would also be odd. However, there is no Ax functionality generically hooked up for gettingk
points to maximize expected hypervolume.BenchmarkMethod
, either by passing differentbest_point_kwargs
to theBenchmarkMethod
or by subclassingBenchmarkMethod
and overridingget_best_parameters
.Detailed changes
BenchmarkResult
Docstrings ought to be self-explanatory.
optimization_trace
becomesoracle_trace
inference_value_trace
as well as anoracle_trace
optimization_trace
can be either, depending on what theBenchmarkProblem
specifies.benchmark_replication
BenchmarkProblem
report_inference_value_as_trace
thatoptimization_trace
be inference value when the problem specifies that inference value should be used. Docstrings should be self-explanatory.BenchmarkMethod
get_best_parameters
and an attributebest_point_kwargs
. If not overridden,get_best_parameters
usesBestPointMixin._get_trace
and passes it thebest_point_kwargs
.best_point_kwargs
is "use_model_predictions".Reviewed By: Balandat
Differential Revision: D61930178